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Abstract. 

The fuhy general calculation of the cosmic error on A^-point cor- 
relation functions and related quantities is presented. More precisely, 
the variance caused by the finite volume, discreteness, and edge effects 
is determined for any estimator which is based on a general function 
of A^-tuples, such as multi-point correlation functions and multi-spectra. 
The results are printed explicitly for the two-point correlation function 
(or power-spectrum), and for the three-point correlation (or bispectrum). 
These are the most popular statistics in the study of large scale structure, 
yet, the a general calculation of their variance has not been performed 
until now. 



1. Introduction 

Astrophysics provides prime examples of random fields, such as the distribution 
of galaxies and the fluctuations of the Cosmic Microwave Background (CMB). 
Such random fields can be characterized statistically by a series of well chosen 
estimators. The most popular ones include counts-in-cells, A-point correlation 
functions, as well as statistics derived from them. Indeed, there are mathemat- 
ical theorems, which state that, under certain conditions, both series describe a 
random process fully. 

Our goal in astrophysics is not simply estimating these statistics, but to 
constrain underlying theories. This aim necessitates a firm control over the ex- 
pected statistical errors from a survey. The theory of errors for finite surveys, 
the "cosmic statistics of statistics" , is therefore of utmost importance for prac- 
tical applications. Such a theory was formulated in the past for counts in cells 
statistics. 

For the A-point correlation functions, however, to date only partial results 
are published, such as calculation of the discreteness effects for the two-point cor- 
relation functional and for A^-point correlation functions for Poisson and multi- 
nomial point processes^, full calculation for the two-point function under the 
hierarchical assumption with edge effects neglected^, and some results in Fourier 
space with various degree of approximations. 
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The aim of this proceedings is to present the general variance calculation 
for A^-point correlation functions with all major contributions included, such as 
discreteness effects, arising from sampling the underlying random field with a 
finite number of galaxies, edge effects, due to the geometry of the survey and the 
corresponding uneven weighting of A^-tuples, and finite volume effects, caused 
by fluctuations at and above the scale of a survey. The underlying technique of 
calculation, as well as the fully general results are presented here; specializations 
such as power spectrum and bispectrum, and approximations, such as weakly 
non-linear perturbation theory, hierarchical assumptions, will be presented else- 
where in collaboration with S. Colombi, and A.S. Szalayp]. 

The next section sets up the general mathematical framework for the cal- 
culation using computer algebra packages. §3 presents the results for N = 2, 
and = 3. The final discussion section outlines how the formulae can be used 
in practice, as well as describes developments in the immediate future. 

2. General Framework 

Many interesting statistics, such as the A^-point correlation functions and their 
Fourier analogs, can be formulated as functions over points in a catalog. The 
covariance of a pair of such estimators will be calculated for a general point 
process under the assumption that the average density is a priori known. This 
is the obvious generalization of the Poisson process when arbitrarily high order 
correlations are present. The number of objects is thus varied corresponding 
to a grand canonical ensemble in statistical physics. The following calculations 
lean heavily on the elegant formalism by Ripley^, which can be consulted for 
details, and are the direct generalization of the framework set up by Szapudi &; 
Szalay.[3 

Let -D be a catalog of data points to be analyzed, and R randomly gener- 
ated over the same area, with averages A, and p respectively. The role of R is 
to perform a Monte Carlo integration compensating for edge effects, therefore 
eventually the limit p — > oo will be taken. A on the other hand is assumed to 
be externally estimated with arbitrary precision. No other assumption is taken 
about the point process. In practice, A is usually estimated from the same sur- 
vey, which gives rise to additional correlations, the "integral constraint bias". 
This effect will be investigated in more detail elsewhere. 

Followinj^, let us define symbolically an estimator D^R'^, with p + q = N 
for a function <I> symmetric in its arguments 

DPR'i = J2Hx^,...,Xp,yu...,yg), (1) 

with Xi 7^ Xj € D,yi ^ yj € R. As an example, the two point correlation 
function corresponds to <^(x,y) = [x,y G D,r < d{x,y) < r + dr], where 
d(x, y) is the distance between the two points, and [condition] equals 1 when 
condition holds, otherwise. Ensemble averages can be estimated via factorial 
moment measures, In the Poisson limit Vg = X^/jg, where ijg is the s di- 

mensional Lebesgue measure, and in the most general case Vsf{xi^ ■ ■ ■ ,Xs)X^^is- 
The function A*/(xi, . . . , Xs) = F{xi, . . . , Xg) can be identified as the full, i.e. 
non-reduced, A^-point correlation function. The connection between these and 
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the reduced A'^-point correlation functions is well known[^, and can be obtained 
through either generating functions, or recursions. 
The general covariance of a pair of estimators is 

E{p,,p,,N,,N,) = {Dl^RfbfRf) = Y: (?)^'(^/) (^/)j!5^+.A-V-^ 

(2) 

with pi + qi = Ni,p2 + q2 = -^2 , S will be specified later, " denotes normalization 
with A, p respectively, i.e. {D = D/ X, R = R/p. The expression simply describes 
the fact that out of the pi and p2 different data points in D we have an z-fold 
degeneracy, as well as a j-fold degeneracy in the random points drawn from R. 
To simplify the exposition of the calculation, it is convenient to assume from 
the very beginning the p 00, i.e. the random process employed for the Monte 
Carlo integration of the shape of the survey is arbitrarily dense. This is usually 
achievable in practice, thus it will not be considered further. The above equation 
describes the cross-correlation of two estimators even for two different objects as 
well: e.g., two particular bins of the two- and three-point correlation functions. 
An interesting special case, A'^i = 1 (the average density in the survey) and 
N2 > 2, is needed for calculating the "integral constraint" correction. 
When the random process is arbitrarily dense only j = survives, 

E{p,,p2,N,,N2) = {{D'^^Rl^bfRf) = (3) 
E (^')^'^"'^^/(l'2,...,Pi,iVi + l,...,iVi+P2-i), 

where S is now an operator acting on /, 

Sk = j ^a{l . ■ ■ Ni)^k{l . . .i,Ni + l, . . . ,Ni + N2 - l) . . . P2N-k- (4) 

The operator is analogous to the phase space integral 5*^0. The dot empha- 
sizes that the integral can be performed only after is acted on / which is part 
of the measure. The phase space has to be calculated in the general case via 
the full factorial moment measure of which / is an integral part. Throughout 
the paper we use the convention that (^) is nonzero only for > 0, Z > 0, and 
k > I, and the variables Xi are denoted with i for simplicity. Here $a and 
denote two different functions, for instance corresponding to two radial bins of 
two estimators of the same or different orders. In the above formula the symme- 
try of $ in its arguments was heavily relied on to achieve the above "standard" 
representation of the operator. 

The dependence of on a, 6, and is not noted for convenience]^ but 
they will be assumed throughout the paper. The estimator]^ for the generalized 
A^-point correlation function is {D — R)^ , or more precisely. 
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where S = J ^fJ-N (without subscript). In this case 5 is a number since it 
corresponds to the Poisson catalog with its simple factorial moment measure. 
The average of the estimator yields 



(6) 



Since the role of $ is effectively a window, with a window operator W this can be 
written symbolically as (wn) = W^n/W. The asymptotic centered covariance 
between two estimators of the above for a general point process in the limit of 
p ^ oo can be written according to the previous considerations as 

{6WN^6WN2) = {Wa,NiWb,N2) " {Wa,Ni) {Wb,N2) = (7) 

Ni\(N2,^ E(ij,Ni,N2)-Sofil,...,i)f{N, + l,...,m+j] 

In the above the operator 5*0 acts on the multiple of the two /'s on the right. The 
above equation is the main result of this paper. While it is quite cumbersome, 
it is easily expandable with the help of computer algebra, as demonstrated by 
the examples of the next section. The special cases rendered will also illustrate 
how the simplicity of the proposed class of estimators exactly manifests itself by 
a mass cancellation of terms. Any other estimator would have extra terms in 
the variance]^ 




3. The cosmic error on the two-and three-point correlation functions 

The above equation was entered into a computer algebra package. The symme- 
tries and simplicity of the estimator give rise to cancellations and possibilities for 
collecting similar terms. This is the reason why the final result for the two-point 
correlation function has only three to six terms, while formally it could have up 
to about a hundred. Alternative estimators, such as DD/RR — 1, etc. would 
not yield significantly less cancellations, therefore error-calculation for them was 
not attempted; although the same formalism applies. 

3.1. The two-point function 

The covariance for the two-point function (or any quantity estimated from dou- 
blets of points, such as the power spectrum) can be expressed after the cancel- 
lations and the possible simplifications as 

{Sw'iSwl) =^{J ^a(l, 2)^3, 4) [^4(1, 2, 3, 4) + 2^(1, 3)^(2, 4)] + (8) 

J J $,(1, 2)$b(l, 3) [e(2, 3) + 6(1, 2, 3)] + 

^ I <i>,(l,2)cl>,(l, 2) [1+^(1,2)]}. 
4 



The above equation is essentially identical to the result of Hamilton^ where 
he calculates the variance of 6, the fluctuation field itself. This is not at all 
surprising.^ The estimator contains exactly the same terms and coefficients 
(regardless of the choice of as 5 itself, which strongly suggests that it is 
(nearly) optimal. 

The above formula contains the different contributions to the error]^ entirely 
mixed. Approximate separation of the different terms appears to be fruitless. 
The only general point to be made is that discreteness effects are absent in the 
first term, while they are present (mixed with the other two effects) in the l/A'^ 
terms, with s > 0. This is naturally true for the higher order calculations as 
well. 

It is worth to emphasize again, that this formula applies for the general- 
ized 2-point function, including the "traditional" 2-point function, and any of 
its incarnations, such as the the power spectrum. In the latter case, esthetic 
or practical reasons might dictate that the errors on the power spectrum are 
expressed in terms of the power-spectrum, bi-, and tri-spectrum, instead of the 
two-, three-, and four-point correlation functions. Since the connection is a 
simple Fourier transform, this trivial exercise is left for the reader. Explicit 
formulae, aimed at practical applications for power-spectrum will be presented 
elsewhere. n 

3.2. The three-point correlation function 

The same method yields (co)variance for higher order estimators as well. We 
present another example, the generalized three-point correlation function. Its 
variance, using again the main result of the proceeding, translates into: 

{SwISwD = (9) 

J cl>a(l,2,3)cl>b(4,5,6)[e(l,2,3,4,5,6) + 

3^(1, 2)^(3, 4, 5, 6) + 9^(1, 4)^(2, 3, 5, 6) + 
3^(4, 5)^(1, 2, 3, 6) + 9^(1, 5, 6)^(2, 3, 4) + 
9^(1, 4)^(2, 3)^(5, 6) + 6^(1, 4)^(2, 5)^(3, 6)] 

J I ^a{l, 2, 3)$fe(l, 4, 5) [e(l, 2, 3, 4, 5) + 

e(2, 3, 4, 5) +2^(1, 2)^(3, 4, 5) + 
2^(1,4)^(2,3,5) +e(2,3)e(l, 4, 5) + 
4^(2,5)^(1,3,4) +e(4,5)e(l, 2, 3) + 
e(2, 3)^(4, 5) +2^(2,4)^(3, 5)] + 
18 f 

^ J $a(l,2,3)$,(l,2,4)[e(l,2,3,4) + 

2^(1,3,4) +e(l,2)e(3, 4) + 
2^(1,3)^(2,4) +e(3, 4)] 

^ j $,(l,2,3)$fe(l,2,3)[e(l,2,3) + 3^(1,2) + 1]}. 
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For simplicity, in the above formula the order of ^ is notated with the number of 
arguments only, e.g., ^3(1,2,3) = ^(1,2,3). The above equation is is less obvi- 
ously useful then that of the two-point correlation function. Nevertheless, given 
a model for the higher order correlation functions, such as weakly non-linear per- 
turbation theory, or any well specified version of the hierarchical assumption, 
the equation can easily be turned into a practical computer program .p^ 

The variance of the four-point and higher order correlation functions can be 
calculated analogously, but it would make no sense to print the results. When 
needed, the formulae generated by computer algebra can be transformed into 
Fortran or C-code directly. 

4. Discussion 

The above method, and the explicit formulae given, can be used to evaluate 
the cosmic error on any statistical measure based on A^-tuples of a distribution. 
This includes, but is not limited to, A^-point correlation functions, A^-th order 
cumulants, cumulant correlators, multi-spectra, etc. 

The above calculation was performed only for the best A^-point estimator 
Any other estimator can be calculated analogously, but be warned that the re- 
sulting number of terms can be overwhelming due to the insufficient cancellation 
arising from suboptimal edge correction. 

The fact that the average density is assumed to be given as an outside 
parameter appears to be fairly restrictive. However, maximum likelihood con- 
text, which is the most important potential practical application of the results, 
it is easy to remedy the situation. The proposed estimator^ can be trivially 
changed by not normalizing with the average density A. This introduces only a 
small modification to the final formulae, and a set of estimators, including that 
of the the average count, contains all information need for constructing likeli- 
hood function. Such a procedure yields full statistical description, takes into 
account fluctuations in the mean, and the fact the average is estimated from the 
same surveys ( "integral constraint" ) . Practical demonstration of this procedure 
will be presented elsewhere. p*] 

The proposed estimator used here is not connected for A^ > 4.|^ Therefore 
the calculations for the higher order connected estimator should be modified for 
accurate results for the connected moments. This trivial but tedious generaliza- 
tion is left for future research. 

The explicit formulae can be specialized for several cases, which will be 
presented elsewhere. The interesting limits include Poisson, Gaussian, weakly- 
nonlinear, strong correlations, hierarchy, shot noise limited, continuous limit etc. 
The detailed discussion of these limits, and specializations to particular statis- 
tics, such as A'' — point correlation functions, multi-spectra, would go beyond 
the scope of the present exposition. Similarly, the main equation yields cross 
correlations between different statistics as well, a must for any investigation in 
the maximum likelihood framework. 

Finally, it is worth to note here, that recent advances in algorithms for cal- 
culating A^-point correlation functions render these objects more interesting then 
ever. Fast algorithms^ will make it possible to calculate A^-point functions from 
very large catalogs, be it artificial or real data, which undoubtedly will culminate 
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in new insights into the subject. The formulae presented in this proceedings will 
provide the firm theoretical grounding for any such investigation. 
Acknowledgments 

This work will form the integral part of a project in collaboration with S. 
Colombi, and A.S. Szalay soon to be published .p| 



References 



1. Bernstein, G.M. 1994. The variance of correlation function estimates. ApJ 

424: 569 

2. Boschan, P., Szapudi, I. & Szalay, A.S. 1994. On the accurate determination 

of the clustering hierarchy of galaxies. ApJS 93: 65 

3. Colombi S., Szapudi I., Szalay A.S. 1998. Effects of sampling on statistics of 

large-scale structure. MNRAS 296: 253 

4. Connolly, A.J., Nichol, R.C., Moore, A., & Szapudi, I., 2000. in preparation 

5. Daley, D.J., & Vere-Jones, D. 1972. In Stochastic Point Processes, Ed. Lewis, 

P.A.W. (New York: Wiley) 

6. Hamilton, A.J.S. 1993. Toward Better Ways to Measure the Galaxy Correla- 

tion Function. ApJ 417: 19 

7. Landy, S.D.,&: Szalay, A. 1993. Bias and variance of angular correlation func- 

tions. ApJ 412: 64 

8. Ripley, B.D. 1988. Statistical Inference for Spatial Processes (Cambridge: 

Cambridge Univ. Press) 

9. Szapudi I., & Colombi S., 1996. Cosmic Error and Statistics of Large-Scale 

Structure. ApJ 470: 131 

10. Szapudi, I., & Szalay, A.S. 1998. A New Class of Estimators for the N-Point 

Correlations. ApJ 494: L41 

11. Szapudi, I., Colombi, S., & Szalay, A.S. 2000. in preparation 



5. An Alternative Technique 

An alternative method of calculation is possible, which is instructive and insight- 
ful, even if less rigorous mathematically then the above formalism using factorial 
moment measures. This alternative technique is well suited for obtaining quick 
results for low order estimators by hand. We demonstrate the calculation for 
two-point correlation function, higher order results can be obtained analogously, 
although it quickly becomes prohibitevely tedious. 

Let us assume that the survey is divided into K bins, each of them with 
fluctuations 6i, with i running from 1 . . . K. For this configuration our estimator 
can be written as 

w = /i25i'52- (10) 
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The above equation uses a "shorthand" Einstein convention: 1,2 substituting 
for ii,i2, and repeated indices summed. The pairwise weights /12 correspond to 
<I> in the main body of the paper, and it is assumed that the two indices cannot 
overlap. 

The ensemble average of the above estimator is clearly /i2Ci2- The contin- 
uum limit (co)variance between bins a and b can be calculated by taking the 
square of the above, and taking the ensemble average. 

{Sui'^dw") = f^jl^ 5253^4) - {SlS2){635^)) . (11) 

Note that the averages in this formula are not connected moments, which are 
distinguished by ()c. 

The above equation yields only the continuum limit terms. To add Poisson 
noise contribution to the error, note that it arises from the possible overlaps 
between the indices (indices between two pairweights can still overlap!). In 
the spirit of infinitesimal Poisson models, we replace each overlap with a 1/A 
term, and express the results in terms of connected moments. There are three 
possibilities, i) no overlap (continuum limit) 

fufu (6234 + 63C24 + 6463) , (12) 

ii) one overlap (4 possibilities) 

^/f2/l'3(ei23+63), (13) 

iii) two overlaps (2 possibilities) 

^/f2/l2(l+62), (14) 

In these equations, for the sake of the Einstein convention we used k, I) — > 
6,ijki- The above substitutions (rigorously true only in the infinitesimal Poisson 
sampling limit) become increasingly accurate with decreasing cell size. In that 
limit, adding the above three equations is equivalent to Eq. ^. 
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